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1.  Introduction 


The  capability  to  extract  particular  pieces  of  information  from  a  data  set  while  maintaining  both 
high  precision  and  recall  is  difficult  and  time-consuming.  The  need  for  faster  information 
extraction  (IE)  without  significant  loss  of  accuracy  has  lead  to  the  creation  of  automated  IE 
programs.  Empirical  evaluation  plays  a  key  role  in  estimating  the  performance  of  promising  IE 
tools. 

General  Architecture  for  Text  Engineering  (GATE)  and  Automap  are  two  such  promising  IE 
tools;  the  former  was  developed  by  the  Natural  Language  Processing  (NLP)  Group  of  the 
University  of  Sheffield  and  the  latter  by  the  Center  for  Computational  Analysis  of  Social  and 
Organizational  Systems  (CASOS)  at  Carnegie  Mellon  University.  Both  programs  have  similar 
named  entity,  date,  and  location  extraction  capabilities. 

Comparison  of  the  two  programs  was  based  on  functionality,  usability,  customization  with 
empirical  performance  evaluation  keyed  to  the  one-dimensional  metrics  precision,  recall,  and  the 
F-measure.  Each  program  was  evaluated  against  three  independent  corpora:  the  Database 
Creation  for  Infonnation  Processing  Methods,  Metrics,  and  Models  (DCIPM3)  message  set,  the 
Soft  Target  Exploitation  and  Fusion  Human  Intelligence  (STEF  HUMINT)  message  set,  and  a 
Google  message  set. 

Extracted  information  can  be  visualized  or  formatted  and  stored  as  Resource  Descriptive 
Framework  (RDF)  triples  for  later  use  in  the  construction  of  an  ontology.  The  result  would  be  an 
information  system  that  provides  fast  and  actionable  intelligence  to  the  Warfighter. 


2.  Automated  IE  Tools 


2.1  GATE 

GATE  v.5.0  is  an  IE  open-source  program.  The  program  features  a  user-friendly  graphical  user 
interface  (GUI)  that  was  used  for  this  project. 

2.1.1  Graphical  User  Interface  (GUI) 

Many  of  GATE’S  features  could  be  utilized  via  the  GUI.  A  few  syntactical  and  visual  errors 
were  encountered  when  using  the  GUI,  but  overall  it  was  a  helpful  aid  to  the  extraction  process. 
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Pipelines,*  Processing  Resources,!  and  Language  Resources!  could  be  loaded  into  GATE  with 
minimal  effort  using  the  GUI.  Various  resources  could  be  implemented  to  assist  in  the  extraction 
effort  given  the  capability  to  load  plug-ins  into  GATE  by  accessing  the  plug-in  manager  from  the 
GUI. 

The  GUI  allowed  a  corpus  to  be  saved  as  an  XML  document,  with  the  option  of  including 
annotations.  This  was  a  useful  capability. 

2.1.2  Annotations 

GATE  displays  extracted  information  by  annotation.  Each  annotation  is  classified  as  a  type, 
such  as  date,  location,  organization,  person,  etc.  Each  annotation  includes  features  such  as  type, 
gender,  kind,  rules,  etc.  A  user  can  also  manually  add  features  to  any  annotation.  These  features 
will  appear  alongside  the  annotations  when  saved  as  an  XML  document. 

A  pre-defined  pipeline  named  ANNIE,  which  can  be  loaded  into  GATE,  uses  a  variety  of 
processing  resources  located  in  GATE’S  plug-in  manager  to  automatically  annotate  a  corpus. 
ANNIE  analyzes  a  corpus  by  tokenizing  the  text,5  running  a  gazetteer,**  and  transducing. V  Once 
ANNIE  has  analyzed  the  text,  the  annotation  types  which  have  been  created  are  compiled  into  an 
annotation  set.  While  viewing  a  document,  a  user  can  select  the  annotation  types  to  be  viewed. 
The  selected  annotations  will  be  highlighted  in  the  text. 

A  user  can  also  create  and  edit  annotations  manually.  This  is  a  necessary  step  to  calculate 
precision  and  recall  since  manually  annotated  corpora,  which  serve  as  the  ground  truth,  must  be 
compared  to  the  automatically  annotated  corpora.  Creating  and  editing  annotations  is  a  long, 
grueling  process  in  GATE,  mainly  caused  by  a  few  specific  yet  annoying  errors. 

2.2  Automap 

Automap  v.2.7.4  takes  a  different  approach  to  automated  IE  than  GATE.  A  corpus  loaded  into 
Automap  undergoes  numerous  preprocessing  steps.  Key  preprocessing  tools  may  include 
stemming  functions,  deletion,  thesauri,  and  removing  symbols,  numbers,  and  punctuation.  Once 
preprocessing  is  complete,  the  user  tags  the  remaining  concepts  in  a  meta-matrix  thesaurus  then 
selects  and  applies  a  sub-matrix.  The  extracted  information  is  then  ready  for  output. 

2.2.1  Text  Preprocessing 

Application  of  a  delete  list  and  generalization  thesaurus  was  sufficient  to  streamline  the  data  for 
this  study. 

*  .  .  . 

An  application  built  from  different  processing  resources  to  process  language  resources. 

^Sub-processes  that  can  be  used  to  build  a  pipeline. 

■^Corpora,  single  documents,  etc. 

^Splits  texts  into  tokens  such  as  word,  punctuation,  space,  etc. 

Sets  of  lists  containing  names  of  entities  such  as  cities,  organizations,  days  of  the  week,  etc. 

!' Implementing  grammars  (patterns)  to  define  entities  not  defined  by  gazetteer. 
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2.2. 1 . 1  Deletion.  Deletion  was  a  tricky  process — clarity  and  concision  had  to  be  weighed 
against  losing  potentially  valuable  infonnation.  The  delete  list  included  with  Automap  was  a 
very  basic  list  of  35  words,  the  majority  of  which  were  pronouns  and  prepositions.  A  far  larger, 
customized  delete  list  was  written  to  pare  the  number  of  concepts.  Since  the  project  involved 
training  Automap  on  only  one  of  the  three  corpora,  the  STEF  HUMINT  message  set,  the  delete 
list  was  aimed  at  removing  all  but  military-relevant  words. 

2.2. 1.2  Thesauri.  Two  thesauri  found  in  Automap  were  utilized.  The  first,  the  Generalization 
Thesaurus,  allowed  two  or  more  concepts  to  be  indentified  as  identical  and  provided  a  common 
concept  label.  For  instance,  this  function  identified  JCOC  as  the  Joint  Civilian  Orientation 
Conference  and  inserted  a  new  symbol  grouping,  Joint  Civilian  Orientation  Conference,  into 
the  processed  document.  In  this  way,  the  concept  is  processed  as  a  single  concept  and  not  3  pairs 
of  concepts.  The  Generalization  Thesaurus  had  several  other  uses,  it:  (1)  turned  plural  concepts 
into  their  singular  fonn,  (2)  converted  dialectal  spelling  differences,  and  (3)  changed  similar 
concepts  into  the  same  concept  (e.g.,  “street,”  “roadway,”  “thoroughfare,”  and  “throughway” 
were  all  deemed  synonymous  with  “road”). 

The  second  and,  perhaps,  more  important  thesaurus,  was  the  Meta-Matrix  Thesaurus.  Words  are 
tagged  in  this  Thesaurus  according  to  the  program’s  embedded  ontology;  that  is,  this  thesaurus 
associates  text-level  concepts  with  meta-matrix  concepts.  AutoMap’s  ontology  offers  several 
classifications  for  any  given  concept  including  knowledge,  agent,  resource,  event,  organization, 
location,  when,  and  attribute. 

Output  from  Automap  can  be  analyzed  at  any  of  three  levels:  (1)  the  concept  network  level,  (2) 
the  entire  meta-matrix  level,  or  (3)  the  sub-matrix  level.  The  sub-matrix  controls  how  the 
different  classes  of  Automap’s  ontology  interact  with  one  another.  The  full  sub-matrix  was 
selected  for  this  study  to  ensure  every  class  interaction  was  displayed. 


3.  Performance  Evaluation 


To  objectively  compare  the  two  IE  programs,  the  single-dimension  quantitative  metrics 
precision,  recall  and  the  traditional  F-measure  were  selected.  Precision  is  the  proportion  of 
documents  retrieved  that  are  relevant  to  a  user’s  information  needs;  precision  takes  all  retrieved 
documents  into  account.  It  is  defined  as 

Precision  (P)  =  tp/(tp  +  fp),  ( 1) 

where  tp*  and  fpt  are  the  numbers  of  true  positive  and  false  positive,  respectively. 


Document  is  retrieved  by  system  and  is  relevant. 
^Document  is  retrieved  by  system  and  is  not  relevant. 
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Recall  is  the  proportion  of  successfully  retrieved  documents  that  are  relevant  to  the  user’s  query; 
recall  corresponds  to  the  true  positive  rate.  It  is  defined  by 


Recall  (R)  =  tp  /  (tp  +  fin),  (2) 

where  tp  is  defined  as  in  equation  1  and  fin*  is  the  number  of  false  negative. 

The  F-measure  can  be  interpreted  as  the  (equally)  weighted  hannonic  mean  of  precision  and 
recall;  the  F-measure  is  defined  as 

F  =  2  •  [(P  •  R)  /  (P  +  R)].  (3) 

There  are  several  methods  to  measure  precision  and  recall  depending  on  how  strictly  or  leniently 
partially  correct  true  positives  are  taken  into  account.  A  partially  correct  true  positive  occurs  when 
a  piece  of  information  is  not  correctly  extracted,  such  as  not  including  a  full  name  or  adding  words 
which  aren’t  part  a  name.  The  “strict”1'  method  considers  partially  correct  true  positives  as  false 
positives.  The  “lenient”11  method  considers  partially  true  correct  positives  as  true  positives.  The 
third  method  uses  the  mean  of  the  strict  and  lenient  methods  to  compute  the  overall  precision  and 
recall  measures.  The  third  method  was  selected  to  evaluate  GATE  and  Automap. 


4.  Results 


Table  1  gives  a  summary  of  the  statistical  results. 


Table  1.  Precision,  recall,  and  F-measure  statistics  for  the  three  corpora. 


GATE  Version  5.0 

Automap  Version  2.7 

Precision 

Recall 

F-measure 

Precision 

Recall 

F-measure 

DCIPM3  Msg  Set 

Date/when 

1.000 

0.967 

0.983 

1.000 

0.484 

0.652 

Location 

0.750 

0.050 

0.094 

0.333 

0.033 

0.061 

Organization 

0.750 

0.214 

0.333 

0.250 

0.071 

0.111 

Person/agent 

0.550 

0.668 

0.604 

0.475 

0.340 

0.400 

Google  Msg  Set 

Date/when 

0.989 

0.989 

0.989 

0.662 

0.179 

0.282 

Location 

0.966 

0.664 

0.787 

0.954 

0.838 

0.893 

Organization 

0.968 

0.724 

0.828 

0.822 

0.751 

0.785 

Person/agent 

0.648 

0.805 

0.718 

0.687 

0.658 

0.672 

STEF  HUMINT  Msg  Set 

Date/when 

0.690 

0.794 

0.738 

0.675 

0.214 

0.325 

Location 

0.947 

0.471 

0.629 

0.680 

0.439 

0.533 

Organization 

0.789 

0.366 

0.500 

0.784 

0.500 

0.612 

Person/agent 

0.333 

0.497 

0.358 

0.818 

0.248 

0.381 

Document  is  not  retrieved  by  system  but  is  relevant. 
hittp://gate.ac.uk/sale/tao/ split.html. 
hittp://gate.ac.uk/sale/tao/ split.html. 
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The  numbers  of  correct,  partially  correct,  false  negatives,  and  false  positives  used  for  precision, 
recall,  and  F-Measure  calculations  are  provided  in  table  2. 

Automap  was  trained  on  the  STEF  HUMINT  message  set  for  this  study.  Since  both  the  STEF 
HUMINT  and  Google  message  sets  featured  a  military  specific  domain,  Automap  performed 
reasonably  well  for  most  entity  types.  Automap  performed  poorly  with  respect  to  the  DCIPM3 
message  set,  which  features  a  high  school  environment  domain.  GATE,  however,  did  not  appear 
to  be  encumbered  by  the  domain  type  and  had  higher  or  equivalent  precision,  recall,  and 
F-Measure  statistics  when  compared  to  Automap. 


Table  2.  Correct  (C),  partially  correct  (PC),  false  negatives  (FN),  and  false  positives  (FP). 


GATE  Version  5.0 

Automap  Version  2.7 

C 

PC 

FN 

FP 

C 

PC 

FN 

FP 

DCIPM3  Msg  Set 

Date/when 

58 

0 

2 

0 

30 

0 

32 

0 

Location 

1 

1 

28 

0 

1 

0 

29 

2 

Organization 

1 

1 

5 

0 

0 

1 

6 

1 

Person/agent 

74 

111 

9 

50 

0 

132 

62 

7 

Google  Msg  Set 

Date/when 

135 

1 

1 

1 

12 

25 

100 

0 

Location 

254 

6 

127 

6 

312 

25 

50 

3 

Organization 

237 

8 

88 

4 

201 

98 

34 

5 

Person/agent 

68 

17 

10 

33 

36 

53 

6 

2 

STEF  HUMINT  Msg  Set 

Date/when 

90 

20 

16 

35 

14 

26 

86 

0 

Location 

114 

3 

128 

5 

58 

99 

88 

1 

Organization 

14 

2 

25 

3 

15 

11 

15 

0 

Person/agent 

32 

58 

55 

82 

28 

16 

101 

0 

5.  Conclusions 


GATE  has  more  potential  than  Automap  as  an  automated  IE  program.  GATE  offers  a  fully 
automated  process,  while  Automap  needs  a  fair  amount  of  user  definition.  The  performance 
metrics  for  GATE,  for  the  most  part,  are  equivalent  or  better  than  Automap’s.  Automap 
performed  significantly  worse  at  extraction  when  dealing  with  concepts  it  was  not  trained  on, 
which  is  a  large  downfall  when  dealing  with  domain  specific  IE. 
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